Less Is More: Picking Informative Frames for Video Captioning
نویسندگان
چکیده
In video captioning task, the best practice has been achieved by attention-based models which associate salient visual components with sentences in the video. However, existing study follows a common procedure which includes a frame-level appearance modeling and motion modeling on equal interval frame sampling, which may bring about redundant visual information, sensitivity to content noise and unnecessary computation cost. We propose a plug-and-play PickNet to perform informative frame picking in video captioning. Based on a standard Encoder-Decoder framework, we develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by maximizing visual diversity and minimizing textual discrepancy. If the candidate is rewarded, it will be selected and the corresponding latent representation of Encoder-Decoder will be updated for future trials. This procedure goes on until the end of the video sequence. Consequently, a compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation. Experiment results shows that our model can use 6∼8 frames to achieve competitive performance across popular benchmarks.
منابع مشابه
A New Unequal Error Protection Technique Based on the Mutual Information of the MPEG-4 Video Frames over Wireless Networks
The performance of video transmission over wireless channels is limited by the channel noise. Thus many error resilience tools have been incorporated into the MPEG-4 video compression method. In addition to these tools, the unequal error protection (UEP) technique has been proposed to protect the different parts in an MPEG-4 video packet with different channel coding rates based on the rate...
متن کاملDeep Learning for Video Classification and Captioning
Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification con...
متن کاملVideo to Text Summary: Joint Video Summarization and Captioning with Recurrent Neural Networks
Video summarization and video captioning are considered two separate tasks in existing studies. For longer videos, automatically identifying the important parts of video content and annotating them with captions will enable a richer and more concise condensation of the video. We propose a general neural network configuration that jointly considers two supervisory signals (i.e., an image-based v...
متن کاملIndexed Captioned Searchable Videos: A Learning Companion for STEM Coursework
Videos of classroom lectures have proven to be a popular and versatile learning resource. A key shortcoming of the lecture video format is accessing the content of interest hidden in a video. This work meets this challenge with an advanced video framework featuring topical indexing, search, and captioning (ICS videos). Standard optical character recognition (OCR) technology was enhanced with im...
متن کاملA New Wavelet Based Spatio-temporal Method for Magnification of Subtle Motions in Video
Video magnification is a computational procedure to reveal subtle variations during video frames that are invisible to the naked eye. A new spatio-temporal method which makes use of connectivity based mapping of the wavelet sub-bands is introduced here for exaggerating of small motions during video frames. In this method, firstly the wavelet transformed frames are mapped to connectivity space a...
متن کامل